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ABSTRACT 

We present a new computational approach to the inversion of solar photospheric Stokes polarization 
profiles, under the Milne-Eddington model, for vector magnetography. Our code, named genesis 
(GENEtic Stokes inversion Strategy), employs multi-threaded parallel-processing techniques to har- 
ness the computing power of graphics processing units (gptjs), along with algorithms designed to 
exploit the inherent parallelism of the Stokes inversion problem. Using a genetic algorithm (ga) engi- 
neered specifically for use with a GPU, we produce full-disc maps of the photospheric vector magnetic 
field from polarized spectral line observations recorded by the Synoptic Optical Long-term Investi- 
gations of the Sun (solis) Vector Spectromagnetograph (vsm) instrument. We show the advantages 
of pairing a population-parallel genetic algorithm with data-parallel GPU-computing techniques, and 
present an overview of the Stokes inversion problem, including a description of our adaptation to the 
GPU-computing paradigm. Full-disc vector magnctograms derived by this method are shown, using 
SOLIS/VSM data observed on 2008 March 28 at 15:45 UT@ 

Subject headings: line: profiles — methods: data analysis — polarization — radiative transfer — Sun: 
magnetic fields — Sun: photosphere 



1. INTRODUCTION 

iHald (|1908f ) first inferred the presence of mag- 
netic fields on the sun by observing the Zeeman-induced 
separation of the components of magnetically-sensitive 
spectral lines, the reliable determination of vector fields 
from spectral line observations has played a pivotal role 
in diagnosing solar magnetism. The Stokes vector, I\ = 
[I\,Q\,U\,V\] T , whose elements arc linear combinations 
of measured polarization intensities at a wavelength A, 
provide a convenient way to observe and characterize 
the ef f ects o f magnetic fields on the absorbing medium. 
lUnnol (|1956t ) first described the Zeeman-induced atten- 
uation of the Stokes vector by magnetic fields in the 
framework of the radiative transfer equations neglecting 
magneto-optical effects. The solution was later gener- 
alized t o include ma g neto- optical (Faraday rotation) ef- 
fects bv lRachkovskvl (|1962fl . 

Much information can be directly e xtracted from 
the ob served Stokes vectors themselves; iRees fe Semell 
(|1979f ) developed a center-of-gravity technique for esti- 
mating the longitudinal component of the magnet i c field 
from Stokes I k.V observations, and lRonan et all (|1987f ) 
described an integral method for recovering the longitu- 
dinal and transverse fields from integrated Stokes linear 
and circular polarization profiles. Furthermore, several 
co nvenient weak-field ca l culati on "recipes" can be found 
in lLandi DeglTnnocentil (|1994h . 

The unique specification of the magnetic and thermo- 
dynamic state of the solar photosphere solely from ob- 
servations of the Stokes vector is classified as an "in- 
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verse" problem. These types of problems are often 
computationally complex and may be ill-conditioned. 
Conversely, the forward-modeling of the Stokes vector 
emergent from an assumed model atmosphere is triv- 
ial. This asymmetry can be exploited to interpret the 
observed Stokes vector within the framework of a mag- 
netic model atmosphere, by tuning its configuration 
to minimize (in a least-squares sense) the model de- 
viation from the observed Stokes vector. We collec- 
tively refer to all such s o lution procedures as "inversion 
methods." lAuer et ahl (|1977f ) developed a now tradi- 
tional Stokes inversion meth od based on the Levenberg- 
Marquardt (l-m) algorithm (Levenberg 1944: Marquardt 
Il963h . which was subsequent ly exte nded and improved 
upon by iSkumanich fc Litei fl987). This optimiza- 
tion approach performs a nonlinear least-squares fit of 
a Milne-Eddington model to the observations, return- 
ing the set of atmospheri c parameters that describe the 
polari zed spectral line. iRuiz Cobo fc del Toro Iniestal 
(|1992t ) extended the optimization method by creating 
an inversion code called SIR (stokes inversion based on 
Response functions), which performs the inversion of ob- 
served profiles while simultaneously inferring the strati- 
fication of the model parameters with optical depth. 

More recently, artificial intelligence methods and pat- 
tern recognition approaches have been gaining ground 
in the c omputational methods of spcctropolarimetric 
analysis. iCarroll fc Staudel (|2001[ ) proposed a technique 
based on an artificial neural network (ann), whereby 
a large database of Stokes profiles (either synthetic or 
pre-inverted by some other means) is used to train the 
ANN to recognize the functional relationship between the 
model parameters and the spectral lines in the train- 
ing set. Once suitably trained, the network can gener- 
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alize the relationship to ot her samples not ex plicitly in- 
cluded in the training set. iRees et all (|2000l ) developed 
a technique, based on t he Princ i pal C omponent Analy- 
sis (pca) formulated bv lPearsonl (|190lh . to decompose a 
line profile into their so-called eigenprofiles. The eigen- 
values associated with these eigenprofiles define a point 
in the model manifold, from which the associated model 
parameters can be calculated by interpolation over the 
training set. This met hod has been subseq u ently used 
on real observations by ISocas-Navarro et all ()2001[ ) and 
lEvdenberg eTafl ([2005I ). 

This paper introduces a new approach to the syn- 
thesis and inversion of spectral lines, based on graph- 
ics processing units (gpus), which can rapidly calculate 
full-disc vector magnetic fields at the photosphcric level. 
The technique is based on the combination of a highly- 
parallel genetic algorithm with a computing architecture 
well suited to exploit the many levels of parallelism in the 
Stokes inversion problem. The remainder of this paper 
is organized as follows. Section [2] presents the obser- 
vational data used in this work. Sections [3] and [4] out- 
line our genetic algorithm optimization engine and GPU- 
programming techniques, respectively. Details of our im- 
plementation of Stokes inversion under the assumption 
of a Milnc-Eddington atmosphere are given in Section [SJ 
Results from the analysis presented in Section [5J Finally, 
we offer some outlooks on the future implementation of 
our method in the vector field data reduction pipeline 
for the Vector Spectromagnctograph (vsm), the scanning 
spectropolarimeter instrument package in operation as 
part of the Synoptic Optical Long-term Investigations of 
the Sun (sous) telescope, located at the National Solar 
Observatory atop Kitt Peak, AZ. 

2. DATA & OBSERVATIONS 
This work uses full-disc SOLIS/VSM observations of the 
four Stokes /, Q, U , and V spectra in a 3.45A bandpass 
encompassing Fc I multiplct #816 near 6302A (see Table 
[1]). For this work, we utilize only the absorption line 
at Ao = 6302.5017A, but note that our inversion code 
calculates the full Zceman pattern of an input transition. 
Laboratory wavelengths for this multiplet and for two 
nearby terrestrial O2 absorpt ion features are taken from 
iPierce fe Breckinridge! (|1973[) . 
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Line-formation parameters for Fe I multiplet #816. 
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Figure Q] shows the solar disc as it was observed 
bySOLis/vSM in the Stokes / continuum (rcdward of 
6302. 5A), on 2008 March 28 at 15:45 UT. 

Figure [2] shows sample Stokes profiles taken from a 
penumbral region in NOAA 10988 (marked by the x 
symbol in Figure [J), close to disc-center. The some- 
what low spatial resolution of SOLIS/VSM pixels (1".125 
pixel -1 ) smears the individual Zeeman components of 
the Fe I 6302. 5A line, preventing observation of fully re- 



Figure 1. Solar disc on 2008 March 28 15:45 UT as observed by 
the SOLIS/vsm instrument in the 6302. 5A continuum. The compass 
points to heliographic solar north. The inset shows NOAA 10988 
in more detail. 



solved Zceman lobes, even in sunspot umbrae where the 
splitting is largest. 
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Figure 2. Observed Stokes profiles (normalized by the local con- 
tinuum) for a penumbral pixel (represented by the X symbol in the 
inset of Figure [TJ belonging to NOAA 10988. SOLIS/vsm spectral 
dispersion is A A = 27.1 ml, and the wavelength scale is relative 
to the Fe I A6301.5 line-core wavelength. The narrow, shallow lines 
are terrestrial O2 absorption features, and are ignored in all anal- 
yses. The vertical lines denote the illumination edges of the CCD. 

For the inversion of this dataset, our routines analyze 
only those pixels whose fractional polarization degree, 
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exceeds a threshold set by the polarimetric noise, defined 
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by 



op 



thresh " 



(2) 



where 0Qj/y are the variances of the Stokes polarization 
profiles in a spectrally-quiet region away from the ob- 
served spectral lines, I c is the continuum intensity (de- 
termined in this same spectral window), and Pthresh is a 
proportionality factor. A value of Pthresh equal to unity 
will signal the inversion code to process all pixels with 
polarization degree strictly larger than the noise, while 
a value of ~ 10 typically provides good discrimination 
between active regions and surrounding quiet-sun, as ob- 
served by SOLis/vSM. While the spectral uncertainties 
will vary somewhat between scanlines and pixels, they 
are typically of the order of the VSM polarimetric accu- 
racy (~ 0.1% of the continuum intensity). Pixels with 
polarization degree less than that given by equation [5] arc 
inverted under the weak-field approximation; the longi- 
tudinal field strength is calculated from the separation 
between the HV line-cores, while the field orientation 
is calculated from Stokes Q, U, and V maxima, as in 
lAuer et all (|l~977l) . To perform the inversion, we utilize 
a class of global optimization algorithms known as ge- 
netic algorithms. 

3. A STANDARD GENETIC ALGORITHM 

Genetic algorithms (gas) are a broad class of search 
and optimization algorithms which exploit computa- 
tional analogues to the pr i nciple s of evolutionary biology, 
first proposed bv lDarwinl (|1859f ). GAs operate on (poten- 
tial) solutions which are encoded into a binary represen- 
tation. This requires distinction between the phenotypic 
expression (the parameters of the problem themselves) of 
the genotypic representation (the encoded parameters). 

Genetic algorithms are receiving ever-increasing at- 
tention as powerful tools for search, optimization, 
and pattern recogn ition in fields as far- ranging as 
engineering design (|Karr fc Freeman! [1999), job-shop 
sched uling (iGrefenstettd 119851 ), and stock mark e t ana l- 
ysis dMahfoud fc Manil |1996[): iDiver fc Ireland! (119971) 
iMdntosh et all (|1998[ ). and [Ramirez fc Fuentesl (|2002l ) 
have applied them to s pectral analysi s of nig httime astro- 
nomical observations: iCharbonneaul (|1995l ) created the 
PIKAIA genetic algorithm and demonstrated its utility 
on se veral distinct prob lems of astrophysical importance, 
while lLagg et al.l (|2004[ ) more recently applied the pikaia 
algorithm to the problem of diagnosing magnetic fields in 
the upper chromosphere with the He I triplet at 10830A. 
They are quite robust, and often succeed where more tra- 
ditional methods fail (e.g. multi- modal optimization). 

The GA solution procedure relies on genetic operators 
which are repeatedly applied to a "parent" population of 
candidate solutions to produce a new "offspring" popula- 
tion of solutions, containing better solutions than those 
found in the previous population. The quantity which 
determines whether one solution is better than another 
is called the fitness, evaluated for each candidate solu- 
tion via a fitness function. Here, the fitness function is 
a x 2 -hke merit score describing model deviations from 
observations. After some number of iterations of this 
process, the population will have converged to a small 
neighborhood around candidate solutions which have the 



best fitness scores. Although a comprehensive review of 
genetic algorithms and operators is well beyond the scope 
of this paper, we present here a simple overview of basic 
GA properties. An A-bit binary string is used to encode 
the j th model parameter, 



b& = b^b[ j) bi j) 



J N-1 



(3) 



where — [0, 1], for each free parameter of the model. 
The convention adopted here is that bit is the most- 
significant bit, while bit N — 1 is the least-significant bit. 
If the j th real- valued free parameter, p j7 is constrained 
such that 



Uj < Pj < Vj 



(4) 



then each binary string can be decoded into a real 
floating-point value via the transformation: 



p j =£- 1 (b^)=u j 
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The genetic algorithm population therefore consists of 
Ap p binary strings of the form: 



MxN bits 



1000010101001001 • • • 1010010100100101, 



(0) 



N bits 



N bits 



where M represents the number of free parameters of 
the model. The following list elucidates the notation of 
Algorithm [TJ which presents a pseudocode listing of the 
basic genetic algorithm functionality. 

• Pt denotes the population of candidate solutions 
at generation t. Each (real-valued) candidate so- 
lution represents a single realization of a plane- 
parallel, Milnc-Eddington m-e model of the line 
formation region. We seek the candidate whose 
forward-modeled Stokes vector shows minimum de- 
viation from the observed profiles. 

• Gt denotes the population of binary-encoded can- 
didate solutions at generation t. 

• F t denotes the population fitness at generation t, 
and is generated by application of the evaluation 
operator, J 7 , to Pj. Better candidate solutions will 
have smaller fitness values. 

• The operator S represents sampling without re- 
placement. The sampling is stochastically biased 
such that better candidate solutions are (proba- 
bilistically) more frequently selected from the pop- 
ulation. 

• The operator £ represents the encoding of a can- 
didate solution into its binary representation. An 
A-bit binary string gives a numerical resolution of 
Api = (vi — Ui) / (2 N — l), if Ui and Vi are the 

lower and upper bounds, respectively, of the i th 
model parameter. The corresponding decoding op- 
eration (equation [5]) is denoted by £ _1 . 
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The operator 1Z represents pair- wise recombination 
of the population. Two candidate solutions (se- 
lected by S) swap segments of their binary repre- 
sentations to create the representations of two new 
offspring solutions. 

The operator M represents a probabilistic muta- 
tion for each member of the offspring population. 
Each bit on the binary string has a small prob- 
ability of being flipped to its complementary bit. 
The mutation operator used in this work was de- 
signed to favor local exploration by mutating less- 
significant bits more frequently. 



Algorithm 1 Pseudocode for a standard genetic algo- 

rithm. 

t = 

Create Po 

Evaluate Fo «- -F[Po] 
while (not termination-condition) 

t <-t + l 



do 



Select : 
Encode : 
Recombine : 
Mutate : 
Decode : 
Evaluate : 
end while 



Pi 
G t 
G t 
Gt 
Ft 
Ft < 



■5[P t _i, 
-£[Ft] 

- n[G t ] 

-M[G t ] 
■S-HGt] 

nF t ] 



We use GPU-computing techniques to offload data- 
parallel, compute- intensive calculations with high arith- 
metic intensity to a massively-parallel high-speed com- 
pute architecture. For this work, we employ an extension 
of the genetic algorithm inv ersion engine developed and 
described in IHarkerl (I2009f ) and a single nvidia Tesla 
Cf060 gpu. We utilize nvidia's Compute Unified De- 
vice Architecture (cuda) programming interface to han- 
dle data transfer and computations the GPU. The next 
section gives a brief introduction to structured parallel 
programming with CUDA. 

4. THE CUDA PROGRAMMING MODEL 

The CUDA programming interface consists of a minimal 
set of extensions to the standard C/C++ language which 
allow a properly-constructed data-parallel algorithm to 
execute among the thousands of concurrently running 
threads resident on the GPU. This is accomplished by 
a kernel function, which controls the computations and 
memory accesses to be performed at the thread granu- 
larity. The kernel function is callable from another user- 
defined function (the host function) , which controls data 
transfers to/from the GPU and is responsible for launch- 
ing the kernel execution. 

A traditional criticism of the application of genetic al- 
gorithms to real-world problems is the fact that they 
tend to be slow, since they process a potentially large 
set of candidate solutions. The CUDA programming in- 
terface allows NVIDIA GPUs to execute such data-parallel 
algorithms, by organizing the computations into groups 
of concurrently-executing threads called blocks, while 
these blocks arc themselves organized into groups of 
concurrently- executing grids. Mirroring this thread or- 
ganization in the genetic algorithm population allows us 
to write a kernel function that dedicates each block to 



the calculations for a specific member of the popula- 
tion, while all threads in each block are dedicated to 
the per- wavelength calculations required by the spectral 
line synthesis and fitness function evaluation for each 
population member. The power of this approach is ev- 
ident; in a serial genetic algorithm, the total number 
of wavelengths to synthesize (per generational iteration 
of Algorithm [TJ) is N\N pop , where N\ is the number of 
wavelengths spanning the spectral line, and N pop is the 
size of the population. While this can only be done one 
wavelength at a time, one population member at a time 
for the serial algorithm, using a CUDA-capable GPU al- 
lows all N\N pop calculations to be done simultaneously, 
constrained only by the physical limitations of the GPU 
hardware itself (i.e., the maximum possible number of 
concurrently-executable threads and total onboard mem- 
ory). 

While a comprehensive review of the CUDA architec- 
ture is outside the scope of this paper, we refer the inter- 
ested reader to the CUDA Programming Guide and Soft- 
ware Development Kit (SDK), curr ently available from 
http : / / www . nvidia . com/get c uda| 



4.1. Thread & Memory Hierarchy 

The CUDA programming model requires an execution 
configuration, whereby the thread distribution is and or- 
ganized into thread blocks, and similarly how individual 
thread blocks are organized into a grid. Figure [3] shows 
schematically how an example execution configuration is 
indexed, so that each thread can be uniquely identified 
in the grid. 

Threads from the same thread block may communicate 
with each other by utilizing shared memory (see below), 
although one must be careful to structure the program in 
such a way as to avoid memory access conflicts between 
threads. Global memory allows threads from different 
blocks to communicate with each other. Once the execu- 
tion configuration is defined, the CUDA kernel function is 
launched by invoking it with a special syntax, shown in 
Algorithm^ Global memory for the input data and out- 
put result is allocated via cudaMalloc, and the data is 
copied from CPU memory to the allocated GPU memory 
with cudaMemcpy. The kernel function is invoked with 
nBlocks blocks of nThreads threads, and the desired re- 
sults are copied back from device to host memory, again 
via cudaMemcpy. Please note, however, that an actual 
production-grade kernel invocation is considerably more 
involved than the toy example presented in Algorithm [5J 

Algorithm 2 An example C-style CUDA kernel invoca- 
tion. 

cudaMalloc( (void**)&dResult , sizeof (hResult) ); 
cudaMalloc( (void**)&dData, sizeof (hData) ); 
cudaMemcpy( dData, hData, sizeof (hData) , 

cudaMemcpyHostToDevice ) ; 
MyKernel«<nThreads , nBlocks»>( dData, dResult ); 
cudaMemcpy ( dResult, hResult, sizeof (hResult) , 

cudaMemcpyDeviceToHost ) ; 

4.2. Memory transfers between CPU and GPU 

An important principle of GPU-computing is to min- 
imize the amount of data that is copied between CPU 
memory and the global memory on the GPU. Band- 
width across the PCI-Exprcss (pci-e) bus connecting 



■5 



C Program 
SntfurntLiI 
Execution 



Serial co-3e 



Parallel kernel 
HernelQ*«"^| > 



5>:ri:i1 i n li: 



Parallel kernel 
Kernell^--^>=-( ) 



Host 



SridO 



BKrcMO.Ol Blade [1,0] 9krtMZ. QJ 



1, J. ,1 U KICK I 

IRHHIfV 



HOLpt 



<3r rtl 



Block CO. 01 BtackU.Ot 




Figure 3. CUDA thread hierarchy ad heterogeneous computing 
paradigm. CUDA kernel functions are interleaved with the host 
code, and multiple (potentially different) kernels may be launched 
from a host. The <<<>>> syntax is one of the CUDA exten- 
sions to the C language; it is used to specify the size and geome- 
try of the execution configuration. Reproduced, with permission, 
from the CUDA Programmi ng Guide, courtesy of NVIDIA corpora- 
tion. {NVIDIA Corp. 20Tiri . 



CPU memory to GPU memory is much lower than the 
shared memory bandwidth, as can be seen in Table [2J 
The table shows the results of an initial bandwidth test 
performed on the GPU used in this work; these figures 
characterize our single realization of hardware compo- 
nents, and will vary depending on the exact system com- 
ponents and hardware specifications. The theoretical 
maximum bandwidth for the Tesla C1060 is 102 GB s _1 
(INVIDIA CorD.|[20T0l) . while we achieve « 72% of this 
theoretical limit in our hardware configuration. 

To maximize the efficiency of the data transfer to 
the GPU device, we modified our genetic algorithm by 
rephrasing the representation from the traditional bi- 
nary arrays to unsigned short integer arrays. For exam- 
ple, consider a binary encoding of M parameters with 
N = 1Q bits each. Instead of defining the binary geno- 
type string as char gene [M*N] , with gene [i] = or 1 
(see equation ([6])), the genotype is defined as unsigned 
short gene [M] , since each unsigned short is internally 



Table 2 

Initial bandwidth tests for the NVIDIA Tesla C1060 used in this 
work. 



Transfer Direction 



Bandwidth 
(GB s- 1 ) 



CPU-to-GPU (global) 1.4634 
GPU-to-CPU (global) 1.1840 
GPU-to-GPU (shared) 73.3165 



represented by 16 bits. The former encoding technique 
requires 8 x M x N bits of storage, while the latter re- 
quires only M x N bits, representing a savings in storage 
space (and transfer times) of a factor of 8. Furthermore, 
the traditional encoding technique is incredibly wasteful, 
since only the least-significant bit of each 8-bit char is 
needed to encode a or 1, meaning only 1/8 of the data 
transferred to the GPU memory would actually be use- 
ful. In contrast, our modified binary genotype encodes 
the same amount of useful information, but occupies a 
fraction of the space in memory 

Transferring the binary encoded parameters to the GPU 
allows the threads to use the high-speed shared memory 
to decode the unsigned short integers to the real floating- 
point values needed for the model synthesis on the GPU, 
so this is always a net gain in performance over decoding 
the parameters serially and transferring the real floating- 
point values to the GPU. 

This modified binary genotype also allows our genetic 
operators to use the bit-shifting and b it-masking tech- 
niques of lluspa fc Scaramuzzinol (|2001l) to operate di- 
rectly on the internal binary representation of the un- 
signed short integers. These bit-manipulation techniques 
yield faster genetic operators than would be used for the 
traditional binary representation, which typically involve 
several nested loops for analysis of each bit in the binary 
array. 

We have presented the computational aspects of our 
GPU code in this section, so we now turn to the imple- 
mentational details of our inversion and synthesis rou- 
tines to be run on the GPU. 

5. IMPLEMENTATION OF STOKES INVERSION 

The polarized radiative transfer equations (prte) de- 
scribe the modification of the Stokes vector of a beam as 
it propagates, in the direction s, through some medium. 
Formally, it is given here as 



^=K A (I A -S A ), 

as 



(7) 



where /j, is the cosine of the heliocentric angle, and K A 
and S A are the propagation matrix and source function 
vector, respectively. 

If we model both continuum and line absorption pro- 
cesses, then 

K x = l+vo&\, (8) 

where rj is the ratio of line-to-continuum absorption co- 
efficients, and the matrix €>a includes absorption and 
magneto-optical effects, parameterized by the magnetic 
and thermodynamic properties of the model atmosphere. 
Expre ssions for these matrix elements may be found in 
(e.g.) lLandi Degl'Innocenti fc Landolfil (|2004f ) and ref- 
erences therein. These matrix elements arc functions of 
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the Voigt and Faraday- Voigt line profiles, calculated to 
hi gh accuracy via the rational function approximation 
of iHui et al.l (|1978l ) . This formulation is extremely well- 
suited to evaluation on a GPU, due to its high arithmetic 
intensity (ratio of math operations to memory accesses) . 

The source function vector describes the ratio of emis- 
sion to absorption in the beam, and includes both con- 
tinuum and line contributions, so that 



S A = S' c e + 77oS';*Ae, 
e = (l,0,0,0) T , 



(9) 
(10) 



where S c is the continuum source function and Si is 
the line source function. Assuming local thermodynamic 
equilibrium reduces both continuum and line source func- 
tion to the Planck function at the local temperature, 
B\(T). Adopting a Milnc-Eddington (m-e) relation for 
the source function variation as a linear function of op- 
tical depth, 



S c = S l = B X {T) = So + Sir = S (l + /V) 



(11) 



where /3o = Si /So represents the inverse of the character- 
istic length scale over which the source function changes 
appreciably, the PRTE admits an analytical solution for 
the model Stokes profiles, l¥ , given here as 



If 



P = 



(1-/3)1 + /3(1 + 7? *a) 1 



l + mA> ' 



(12) 
(13) 



where 1 is a 4 x 4 identity matrix and I c denotes the 
observed local continuum intensity. This m-e solution 
is characterized by magnetic and thermodynamic pa- 
rameters assumed to be constant with depth through 
the line-formation region, here collectively represented 
by the model vector of free parameters, p. Since we 
do not consider gradients with respect to optical depth 
of any parameter except the source function, the M-E 
atmosphere represents a kind of integrated behavor of 
the true parameters over the height of line-formation 
(jOrozco Suarez fc Del Toro Iniesta!l2007t ). Formally, the 
k th candidate solution in the genetic algorithm represents 
a single model vector 



Pfe = [B,i/j,x, Ao,adc, AX d ,t]o,S ,/3 ) 



(14) 



where B is the magnetic field strength, ip is the incli- 
nation of the field with respect to the observer's line of 
sight, x is the azimuthal angle of the field, Ao is the 
line-center wavelength of the spectral line, a^c is the 
atomic damping constant of the spectral line, AA d is the 
Dopplcr line-width, j]q is the line-to-continuum opacity 
ratio, and So and /3q are the linear source function coef- 
ficients such that the continuum intensity is given by the 
Eddington approximation as 



Ic = S + fJ-Si = S (1 + /JqaO i 



(15) 



To maintain generalizability in the inversion code, the 
full Zeeman pattern of the spectral line is calculated at 
the start of inversion. This is done only once, so the 
overhead incurred is negligible, considering the flexibil- 
ity gained. The particular spectral line to be synthesized 
is configurable by the user, and the code contains all the 



necessary generalizations of the M-E solutions to allow 
it to function with arbitrary photosphcric spectral lines. 
The fitness function to be minimized by the genetic al- 
gorithm is the following x 2_ hke merit function, which 
quantifies the fit of the Stokes vector (I M ) generated by 
the model p/j (with v = 4N\ — M degrees of freedom) to 
the observations (1°), given here explicitly as 



X 2 (Pk 



1 



[if^O-lf^p*)] 2 , (16) 



where i = I,Q,U,V. The quantities wij are weighting 
factors, traditionally used to adjust the contribution of 
different wavelengths to the total deviation across the 
spectral line, and are discussed further in Section 15.41 
Here, the j index is dropped from the weights; they 
are taken as constant over the spectral line, but distinct 
for each of the four Stokes profiles. Using this form of 
the merit function allows the calculation of uncertainties 
(over the \ 2 hypcrsurface) in the recovered model pa- 
rameters by straightforward techniques, once the genetic 
algorithm has converged. 

It is an important principle of optimization to work in 
the smallest possible parameter space; reducing the di- 
mensionality of the model vector will increase the speed 
of the inversion and enhance the stability of the algo- 
rithm, since there are fewer (potentially degenerate) pa- 
rameters to simultaneously determine. The next section 
describes some of the techniques used in our approach to 
reduce the dimension of the parameter space and there- 
fore enhance the efficiency of the genetic search. 

5.1. Reduction of the model manifold 

Although the line-center wavelength can be a free pa- 
rameter of the fit, GENESIS instead directly uses the ob- 
served line-center wavelength as measured from the core 
of the Stokes / profile. After calibrating the observed 
wavelength scale by measuring the separation of the two 
terrestrial O2 absorption lines near 6302A, a center-of- 
symmetry approach, 

A sym = argmin S(A,) = £ |/(A i+J -) - (17) 



is used to determine a rough estimate of the line-center 
wavelength. This estimate is refined by bracketing X sym 
with a wavelength triplet and calculating the minimum 
of its uniquely-fit polynomial. 

The m-e model atmosphere specifies a source func- 
tion linear in optical depth, characterized by its value 
at the r = photospheric surface (So) and its in- 
verse characteristic length scale (A))- Inspection of the 
Unno-Rachkovsky solutions reveals a simple normaliza- 
tion schem e that eliminates t he dependence on So, as was 
adopted in lAuer et al.l (|1977| ). Here, we define a modified 
Stokes vector, 



If ^/ c e-lf, 



(18) 



which consists of the Stokes / line depression and Stokes 
Q, U, and V profiles. Note we choose to leave (3q (which 
influences the amplitude of the synthesized Stokes pro- 
files) as a free parameter of the fit. 



7 



5.2. Limited spatial resolution & scattered light 

To account for limited spatial resolution, a new free pa- 
rameter is introduced; the magnetic fill-fraction a repre- 
sents the fractional pixel area occupied by the magnetic 
field. The remainder of the pixel area (1 — a) is assumed 
to be field-free. The Stokes vector then becomes a linear 
superposition of the magnetic and non-magnetic profiles, 
weighted by a, 

If <- alf + (1 - a)ir&- (19) 

The non-magnetic profile is assumed to be a quiet- 
sun Stokes / profile from the local surroundings. Us- 
ing a locally-averaged quiet-sun profile would require 
breaking one of the most advantageous properties of 
the algorithm, namely that each scanline/pixel can be 
inverted independently of the data from neighboring 
scanlines/pixcls. Furthermore, this approach requires 
a tremendous amount of disk I/O, and can be quite 
slow. To maintain a totally independent scanline inver- 
sion, we have taken a different approach; the center-to- 
limb variation (CLV) of the quiet-sun Stokes I profile 
has been measured (Harvey, 2010, private communica- 
tion) and parameterized as a pure Voigt function char- 
acterized by its amplitude, full-width at half-maximum 
(FWHM) and atomic damping parameter. Gaussian and 
Lorentzian components of the profile can be derived from 
the FWHM. A 3 rd -order polynomial is fit (as a function 
of heliocentric /i) to the CLV of each of these parameters. 
The resulting polynomial fits are used within the inver- 
sion to calculate the appropriate values of the quiet-sun 
parameters as a function of disc position for each pixel. 
The non-magnetic profile is then synthesized from these 
parameters, centered on the observed Stokes / line-center 
wavelength. 

We account for scattered light at the ~5% level by first 
correcting the measured continuum. The baseline of the 
observed Stokes / profile is increased, and subseq uently 
renormalized to the corrected continuum, following |GrayJ 
(f2005h . 

5.3. Thermodynamic parameters 

It is well known that there exists some level of degen- 
eracy between the magnetic and thermodynamic param- 
eters, with respect to their influ ence on the model pro - 
file line-shapes. Figure 11.1 of Idel Toro Iniestal (|2003f ) 
shows an explicit example of this effect; it is not al- 
ways clear whether one combined magnetic and thermo- 
dynamic configuration leads to a better fit to the observa- 
tions than ano ther, even if the con figurations are notice- 
ably different (|Borrero et al.|[20Tll) . The convention typ- 
ically adopted by inversion practitioners is that the ther- 
modynamic model parameters are of less importance to 
the final quality of the fit than the magnetic parameters. 
ISkumanich fc Litesl (|1987ft suggested that in order to find 
a robust fit, some of the thermodyn amic parameters must 
be fixed prior to the inversion, and lBorrero et all (|2011[ ) 
investigated the effects of holding the atomic damping 
at a constant value during their inversions. They found 
negligible differences between the recovered vector mag- 
netic fields. It is not clear, however, that this proce- 
dure is generally acceptable for o bservations with muc h 
higher spectral resolution than in iBorrero et al.l (|2011l ) . 
We have decided not to hold fixed any of the thermody- 



namic parameters, and have implemented a "prc-fitting" 
initialization in which we fit a non-magnetic Stokes / 
profile to the observed Stokes / profile, using a simple 
Levenberg-Marquardt algorithm. This returns values for 
the atomic damping parameter, etd c , Doppler width A Ad, 
line-to-continuum opacity ratio, 770, and source function 
parameter, (3q. Assuming a non-magnetic model for a 
(potentially) Zeeman-broadened profile will, of course, 
lead to errors in the derived thermodynamic variables. 
However, these values are utilized only to constrain the 
parameter space to sensible ranges within which the GA 
can search. 

5.4. Weighting scheme 

Since the Stokes / profile will always have much larger 
signal strengths than the polarization profiles, deviations 
between the observed and synthesized Stokes / profiles 
will dominate the x value, essentially causing the algo- 
rithm to fit the model atmosphere solely to the Stokes 
/ intensity profile. To mitigate this effect, we equalize 
the importance of all four Stokes profiles by using an 
"inverse-max" weighting scheme to ensure that all devi- 
ations contribute roughly equally to the calculated \ 2 . 
The weighting scheme is given here explicitly as: 

w I =(l c -IS hs )~ 1 (20) 
WQ = (max |Qf a |) -1 (21) 

i^ir 1 (22) 

w v = {max |U A obs |r\ (23) 
where /Q bs is the observed line-core intensity. 

5.5. Model manifold boundaries 

The genetic inversion will be most efficient in a pa- 
rameter space that has the smallest physically-realistic 
domain for each parameter. By seeding the initial popu- 
lation of the genetic algorithm with some specific heuris- 
tic knowledge of the problem at hand, we can both ensure 
that we start with at least a few high-quality solutions, 
and suitably restrict the domain of the searchable param- 
eter space. Both scenarios will accelerate the convergence 
and improve the final accuracy of the solutions. 

Here, the initial population includes representations 
of magnetic field vectors gene r ated from the weak-field 
approximations in lAuer et al.l (|1977f ) , as well as from a 
functional relationship between the field geometry and 
integrate d measures of the S tokes polarization profiles 
found in iRonan et alj (|1987f ). The longitudinal field 
str ength estimate d from the center-of-gravity approach 
by IRees fe Semell (|1979D is also seeded into the initial 
population. In the case of full-disc inversions, where ev- 
ery on-disc pixel is inverted, the trivial non-magnetic so- 
lution is seeded into the population as well. 

The field inclination domain is naturally restricted to a 
single polarity, ip € [0, 90]° or [90, 180]°, initialized to be 
in agreement with the order of the blue- and red-lobes of 
the observed Stokes V profile. The seed fields are checked 
to ensure that they are consistent with this polarity. 

Additionally, we use empirical knowledge derived 
specifically from SOLis/vSM data to discriminate be- 
tween different magnetic structures observed on the disc. 
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NOAA 10988, 28 March 2008 15:45 UTC 




X [arcsec] 

Figure 4. Line morphology classification for NOAA 10988. The 
image depicts Stokes I line-core intensity, overlaid with red, blue, 
and yellow contours demarcating umbral, pcnumbral, and plage 
regions, respectively. 

Spatial pixels are labeled as likely belonging to the struc- 
tures seen in Figure^! based on their observed continuum 
and line-core inten sities relative to the average quiet-sun 
profile. Following Hestroffcr & Magnan (1991), we ap- 
ply a simple correction for limb-darkening to flatten the 
observed intensity profile in pixels far from disc-center, 
before a label is assigned. The details of this empirical 
classification scheme (and the subsequent compartmen- 
talization of the B-a subspace) are given in Table [3J 



Table 3 

Active region structure discrimination parameters. 



discriminator 


Umbra 


Penumbra 


Plage 




< 0.65 


[0.65-0.90] 


> 0.90 




< 0.69 


> 0.69 


> 0.85 


B[G] 


[1000-3500] 


[0-2500] 


[0-1500] 


a 


[0.25-1] 


[0.25-1] 


[0-0.5] 



5.6. Population initialization 

As described in Section [3J the genetic algorithm works 
by continually evolving good solutions out of a popula- 
tion of candidate solutions, represented by the reduced 
model vectors 

Pk = [B, if), x, ado AAd, ry , Po, a] T , (24) 

with the number of free parameters, M = 8. Hardware 
limitations dictate the maximum size of the population; 
the Tesla C1060 GPU contains 30 multiprocessors, each 
of which is composed of 8 stream processors. Therefore, 
the total number of thread blocks (candidate solutions) 
which can be simultaneously processed is iVp 0p = 30x8 = 
240. 

It is traditional to initialize the genetic algorithm with 
a random (but bounded) population to ensure enough 
diversity for the genetic operators to produce meaning- 
ful evolution. However, we instead generate the initial 
population (of non-seeded candidate solutions) by re- 
peatedly sampling from a Sobol sequence generated via 



the lBratlev fc Foxl (|1988l ) algorithm. An M -dimensional 
Sobol sequence is a guasi-random sequence that is maxi- 
mally self-avoiding; the points in the sequence tend to 
(roughly) evenly distribute themselves throughout the 
M-dimensional hypercube [0,1] M . This property gives 
robust, even coverage of the parameter space without 
placing the initial population on a regularized grid, which 
would completely inhibit the search action of the recom- 
bination operator (1Z). In addition, this approach elimi- 
nates chance clustering in the initial population, thereby 
avoiding the processing of redundant candidate solutions. 
The final benefit of using quasi-random initialization lies 
in the efficiency of population restarts; when the popu- 
lation has converged to some self-monitored degree, all 
but the best individual(s) are re-initialized. Generat- 
ing new candidates according to a quasi-random schedule 
guarantees that we will be refreshing the "gene pool" in 
the most efficient way, by using the self- avoidance prop- 
erty to automatically ignore previously-sampled regions 
of the parameter space. We exploit this property to ad- 
dress any issues related to premature/false convergence; 
every -/V samp iterations of the genetic algorithm, we dis- 
card the worst half of the population ( "dead solutions" ) 
and reinitialize them according to the Sobol mechanism 
described above. In practice, for a maximum number 
of generations N gen over which to evolve, we note that 
^Vsamp ~ A^gen/4 provides a good balance between deep 
genetic search and the introduction of new genetic ma- 
terial. We allow the population to evolve for -/V gen = 100 
generations. 

6. RESULTS & DISCUSSION 

Proceeding along each scanline and performing the 
GENESIS inversion on the corresponding spectra for each 
pixel builds a map of the model parameters over the full- 
disc. Figure [5] shows the magnetic field strength, inclina- 
tion, azimuthal angle, and fill-fraction over the full-disc 
as inferred by the GENESIS inversion. The inset shows 
NOAA 10988. 

The 0° reference direction for the azimuthal angle is 
along the horizontal axis of the image. The field has not 
been resolved of the 7r-ambiguity inherent in all Stokes in- 
version techniques, hence the antisymmetric color wheel 
Nevertheless, the radial structure of penumbral fields is 
well-determined. The fill-fraction displays the expected 
behavior of values « 1 for the umbral and pcnumbral 
regions, with a decline to values < 0.5 for surrounding 
plage regions. 

The statistical spread of the final population provides 
a convenient means to generate initial estimates of the 
uncertainties associated with the fitted model parame- 
ters. Measuring the spread around the identified opti- 
mum gives population uncertainties, from which we esti- 
mate the gradient of the x 2 manifold: 

[v,'(p°-)] J ^ 2(p °;2 Pl ; 2 "" ) . <*» 

where is selected from the best of the final popula- 
tion, chosen such that we avoid numerical difficulties in 
the calculation of the finite differences (i.e. ratio of two 
very small numbers). Using this estimate, we bootstrap 
the derivatives to higher accuracy by Richardson extrap- 
olation to zero stepsize within Neville's algorithm (see, 




Figure 5. (top left): field strength, (top right): field inclination, (bottom left): field azimuth, (bottom right): fill-fraction. The vertical 
stripe down the center of the azimuthal angle image is an artifact of the dual Rockwell camera system in SOLIS/vsm, and do not appear in 
newer Sarnoff camera data. The compass points to solar north. 



e.g.. iPress et al.lll988ft . The Hessian matrix, H, is thus 
calculated as the Jacobian of the resulting gradient vec- 
tor jjnd_the_jwiance of the i th model parameter is given 
bv lSanchez Almeidal (|1997j ): 



H 



x 2 (p opt ) 

M 



(26) 



This evaluation requires a (potentially) large number of 
fitness function calls, but is only done once per pixel, so 
the small overhead incurred is outweighed by the ben- 
efit of having an accurate error estimate for the model 



parameters. 

Table [4] presents an estimation of the uncertainties in 
the fitted model parameters determined in this man- 
ner. The uncertainties are broken down according to the 
structure discrimination method in Section [531 F° r each 
structure, we calculate the average uncertainty, Aj, and 
corresponding standard deviation, . The table shows 
the quantity A^icA; , so that Ai+ScAi represents an up- 
per limit to the model parameter uncertainties in 99.73% 
of the pixels belonging to each active region structure. 
For example, 99.73% of umbral pixels have uncertainties 
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Table 4 

Distribution of uncertainties in Fe I A6302.5 magnetic parameters, 
broken down by structure. 



Structure 


A B ± <ta b 


A 7 ± (TA 7 


A x ± fTA x 


At, ± (TA Q 




(G) 


(°) 


(°) 


(xl(U 3 ) 


Umbra 


1.3±2.9 


0.45±0.69 


0.73±0.89 


2.4±5.9 


Penumbra 


2.4±4.5 


0.62±0.80 


0.73±0.91 


2.5±5.4 


Plage 


1.2±2.7 


0.65±0.79 


0.84±0.96 


2.1±3.6 



in field strength of 10 G or less. These quantities are 
not meant to represent the uncertainties in any particu- 
lar pixel or structure, but instead characterize the overall 
distribution of uncertainties derived from the inversion. 
It should be noted, however, that these uncertainties are 
derived solely from the topology of the x 2 hypersurface, 
and do not include systematic errors resulting from the 
assumptions and/or limitations of the Milne-Eddington 
model, which is sometimes not a suitably accurate de- 
scription of the real solar atmosphere. 

Table [3] shows a speed comparison between the serial 
and GPU-accelerated versions of the genesis inversion 
code, for both iron lines, averaged over 50 independent 
runs. The total evaluation time (AT eva i) is the time spent 
synthesizing the model spectra and evaluating the var- 
ious contributions to equation [THJ For the serial code, 
the evaluation time consumes a strong majority (65.7% 
and 78.5% for Fe I A6302.5 and Fe I A6301.5, respec- 
tively) of the total algorithm runtime, AT run . Use of the 
GPU as a co-processor has reduced the total evaluation 
time to only 7.1% and 12.1%, respectively, of the GPU- 
accelerated total runtime. The total runtime is quite 
stable for both the serial and GPU-accelerated versions of 
the code. The table shows la standard deviations, which 
demonstrate that the serial and GPU- accelerated versions 
can vary in their total runtimes by up to a minute. The 
GPU-accelerated evaluation time further shows why the 
GPU chip architecture is so amenable to parallel compu- 
tations. The variation in the average time spent inter- 
acting with the GPU is less than a second; this stability 
is precisely due to the many-cored nature of the GPU, 
which uses a very efficient thread scheduler to keep the 
cores of each multiprocessor optimally busy. 

Overall, the GPU-accelerated Fe I A6302.5 inversion is a 
factor of 2.6 times faster than the serial version, with an 
increase to a factor of 5.2 for the Fe I A6301.5 non-normal 
triplet. The total synthesis and evaluation time shows 
just how powerful the CUDA programming paradigm can 
be, performing the computations 23.7 and 33.3 times 
faster than the serial versions of the code can manage. 
Since the non-normal triplet Fe I A6301.5 has four con- 
tributions (from levels with equal AMj) to each of the 
three Zeeman components, a greater proportion of work 
is done on the GPU. This highlights another principle of 
GPU computing; the more work you assign to the GPU, 
the more efficiently it can perform it. 

7. CONCLUSIONS 

We have described a novel computational approach to 
the inference of photospheric vector magnetic fields from 
observations of the Stokes polarization profiles. Our new 
inversion code, named GENESIS, is capable of quickly pro- 
ducing full-disc spatial maps of the magnetic structure 



of the solar photosphere observed by the SOLis/vSM in- 
strument located atop Kitt Peak at the National Solar 
Observatory, in a fraction of the time required by a sim- 
ilar serial technique. The inversion code is capable of 
recovering magnetic fields with uncertainties on the or- 
der of 0.5%, with errors in the field orientation of a few 
degrees. Fill- fractions are recovered with uncertainties 
on the order of 2%. Currently, the code is only capable 
of inverting a single line at a time, though we plan to 
investigate the extension to the simultaneous inversion 
of both lines of the Fe I A6302 multiplet. 

We have shown the technique to be amenable to the 
reduction and analysis of large volumes of spectropolari- 
metric data. To this end, we arc currently investigating 
the the assimilation of the GPU hardware and special- 
ized CUDA-based algorithms into the SOLis/vSM vector 
field pipeline. Our long-term goals for this work are to 
provide near real-time vector magnetic fields to the sci- 
entific community. Increasing the cadence of SOLis/vSM 
vector data products will also allow us to support and 
complement observations taken by the Solar Dynamics 
Observatory (sdo) Hclioseismic and Magnetic Imager 
(hmi), which produces full-disc vector magnetograms at 
4096 x 4096 resolution with a cadence of approximately 
12 minutes. 

The GPU programming paradigm is highly scalable; a 
compiled CUDA application can execute on any CUDA- 
capable device, subject to hardware limitations. The 
thread scheduler will automatically allocate the appro- 
priate number of thread blocks to the stream processors. 
Coupled with the Message Passing Interface (mpi) to par- 
allelize over scanlines (i.e. independent scanlines are in- 
verted independently, utilizing their own distinct GPU), 
this could lead to incredibly fast (i.e. near-realtime or re- 
altime) full-disc inversions with a modest number of CPU- 
GPU pairs. Improvements in GPU hardware are steadily 
advancing; the Fermi architecture is the recent successor 
to the Tesla architecture, offering up to 512 CUDA cores, 
larger-capacity memory banks, and increased floating- 
point performance. With the next-generation Kepler ar- 
chitecture on the horizon, the prospects for accelerated 
solar data processing are indeed promising. Finally, as 
GPU computing matures, we expect to extend this ap- 
proach to better hardware, with a cautious eye toward 
spectropolarimetric analyses of data recorded by the up- 
coming Advanced Technology Solar Telescope (atst). 
The volume of data expected from this next-generation 
observatory will greatly exceed that of the current gen- 
eration, requiring new and faster techniques to properly 
handle and reduce the observations in a timely man- 
ner. We feel the integration of GPU-accelerated data- 
reduction techniques will be key for the analysis of such 
large datasets, and may make available important (near) 
real-time information on the photospheric vector mag- 
netic field to the space-weather forecasting community. 
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Table 5 




Timing 


profiles for GENESIS inversions 


of Fe I AA6301. 5,6302. 5 


mode 








(min) 


(min) 




6301.5A 6302. 5A 


6301.5A 6302. 5A 


SERIAL 


187.83±0.69 84.56±0.26 


147.54±0.68 55.53±0.17 


GPU-ACC. 


36.40±0.15 32.82±0.36 


4.43±0.01 2.34±0.01 
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